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Grant  AFOSR-88-0227  —  Final  Technical  Report 


The  research  efforts  supported  by  AFOSR  Grant  AFOSR-88-0227  concentrated  on 
three  sorts  of  neural  network  problems.  First,  we  studied  the  representation  problem  for  the 
class  of  single-hidden-layer  feedforw  :trd  networks,  which  is  fundamental  for  understanding 
limitations  of  learning  algorithms,  :ind  which  also  contributes  to  understanding  the  behavior 
of  learning  algorithms  in  applications  involving  low-complexity  networks.  The  second  kind  of 
problem  studied  concerns  dynamic  behavior  in  neural  networks  containing  feedback  (trellis- 
structured  networks  in  one  particular  application).  Our  work  focused  on  studying  stability 
issues  and  cxploiing  me  applications  of  computational  complexity  theory.  Third,  the  PAC 
learning  paradigm  (Probably  .\lmost  Correct)  was  analyzed  with  the  goal  of  characterizing 
the  effects  of  statistically  dependent  .sequences  of  training  examples  on  learning  performance. 
The  goal  of  all  of  these  efforts  was  to  discover  and  explore  insights  about  fundamental  limita¬ 
tions  on  the  computational  capabilities  of  analog  neural  systems  and,  where  possible,  of  more 
general  classes  of  physical  systems  as  well. 

We  turn  first  to  a  discussion  of  our  work  on  the  repreocntation  problem.  Despite  its 
importance  and,  as  it  turns  out,  its  amenability  to  study  with  familiar  tools  from  functional 
analysis,  the  characterization  of  functions  that  may  be  represented  by  single-niuden-iayer 
feedforward  (SHLFF)  neural  networks  has  only  very  recently  been  rigorously  determined. 
That  i.«  not  to  say  that  the  answer  was  unexpected!  Because  of  extensive  empirical  evidence, 
reinforced  by  powerful  mathematical  results  like  Kolmogorov’s  representation  for  functions 
of  several  variables,  it  was  reas<jnable  to  believe  that  the  class  of  functions  implemented  by 
SHLFF  neural  networks  was  at  least  a  ’darge”  subset  of  some  class  of  well-behaved  func¬ 
tions.  Using  the  Hahn-Banach  theorem  on  a  suitable  formalization  of  this  problem,  George 
Cybenko  gave  an  elegant  prcx^jf  ili  it  the  SHLFF  neural  networks  generate  a  dense  set  of 
functions  in  suitable  topologies,  d.-  r  similar  results  have  been  obtained  by  Funahashi.  by 
Hornik,  Stinchcombe,  and  Whiir  t  l>v  Jones  and  Barron,  The  result  holds  for  a  wide  class 
of  neuron  (sigmoidal)  nonlinean' .  -  it  we  wish  to  emphasize  that  once  a  particular  choice 
of  nonlinearity  has  been  made.  '  e  .'j  •:  •<Ii'  of  the  network  uses  it  (which  is  in  distinction  to 
lesulto  which  follow  from  Kolrriog' c -v  •-  representation).  Thus  the  functions  which  are  dense 
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are  finite  linear  combinations  of  the  form 


jV  n 

^SHLFfi,^ It  ■  ■  ■  I  ^n)  ~  ^t^k) 


i>l  k=l 


The  “price”  of  elegance  in  Cybcnko’s  approach  (and  in  all  the  others  mentioned  except 
for  the  very  recent  work  of  Jones  and  Barron)  is  its  existential,  rather  than  constructive, 
solution  to  the  problem  of  approximation.  Our  work  has  attacked  the  approximation  prob¬ 
lem  directly  using  techniques  related  to  the  Radon  transform  and  reconstruction  from  projec¬ 
tions.  (These  tools  are  familiar  ones  in  the  signal  processing  literature  dealing  with  tomogra¬ 
phy,  where  functions  of  2  and  3  spatial  variables  arise.)  This  allows  us  to  develop  error 
bounds  which  relate  the  number  of  nodes  in  the  required  hidden  layer,  N  in  the  above  equa¬ 
tion,  to  smoothness  properties  of  the  function  being  approximated  and  to  the  quality  of  the 
desired  approximation.  Furthermore,  the  calculations  required  to  carry  out  the  approxima¬ 
tion  are  standard  ones  from  signal  processing;  an  approximation  for  the  “toy”  XOR  function 
can  even  be  carried  out  analytically.  \  paper  describing  the  constructive  approxim.ation 
approach  to  representation  by  SULFF  neural  networks  was  presented  at  the  International 
Joint  Conference  on  Neural  Networks  in  Washington,  June  1989  [l].  In  view  of  the  recent 
work  of  Jones  and  Barron,  it  would  be  of  interest  to  investigate  how  the  error  bound  in  [l] 
might  be  sharpened  since  at  least  under  some  assumptions  about  regularity  of  the  underlying 
function,  approximations  accurate  to  order  l/N*,  i.e.  accuracy  inversely  proportional  to  the 
number  of  hidden  nodes,  can  be  obtained. 

Our  work  involving  extensions  of  the  constructive  approximation  have  concerned  a  set 
of  more  “practical”  issues  related  to  the  numerical  conditioning  of  approximations  methods 
based  on  the  SHLFF  architecture.  In  this  later  work  [2),  the  analysis  of  the  underlying  one- 
dimensional  approximation  problems  was  considered.  Some  interesting  conclusions  were 
reached.  First,  the  inherent  advantage  of  the  sigmoidal  nonlinearity  used  in  neural  networks 
over  classical  polynomial  nonlinearii  ifs  is  clearly  revealed.  The  inability  of  poly  in..,r.ial  inter¬ 
polating  functions  to  give  uniformly  g'xsl  function  approximations  (indeed,  the  sup  norm  ot 
the  approximation  error  can  grow  <'\ponentially  fast  with  the  number  of  interpolating 
points!)  is  well  known  in  approximation  theory.  This  is  one  of  the  strongest  motivations  for 
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the  use  of  interpolating  splines  for  function  approximation.  With  splines,  the  sup  norm  of 
the  approximation  error  decreases  as  more  interpolating  points  are  added;  the  rate  of 
improvement  in  approximation  depends  on  the  smoothness  of  the  underlying  function. 
Furthermore,  linear  and  cubic  splines  can  be  uniformly  approximated  with  a  logistic  sig¬ 
moidal  function  of  the  kind  commonly  used  in  neural  networks,  and  thus  the  sigmoidal 
neural  networks  inherit  the  e.xcellent  appro.ximation  capabilities  that  are  associated  with 
splines. 

A  second  conclusion  duplicates  one  obtained  earlier  in  ilj.  A  fixed  set  of  input/hidden 
layer  connection  weights  may  be  preselected  (once  the  number  of  hidden  nodes  is  chosen  and 
the  approximation  error  bound  is  fi.xed)  without  sacrificing  the  “universal  approximation" 
capabilities  of  the  network.  In  terms  of  the  analogy  with  spline  approximation,  this 
corresponds  to  the  fact  that  good  approximation  can  be  achieved  with  preselected  interpola¬ 
tion  points  (knots).  Consequently,  the  SHLFF  network  may  basically  take  the  form  of  a 
CM\C  network,  where  input/hidden  layer  structure  and  weights  remain  fixed  while 
hidden/output  layer  weights  are  adjusted  (perhaps  through  a  training  procedure  or  other 
learning  mechanism)  in  order  to  give  a  good  fit  based  on  sample  values  of  the  function.  In  the 
case  of  mean  square  error  approximation,  the  main  advantage  is  that  no  “error  backpropaga- 
tion”  is  required  for  training,  and  this  means  that  the  amount  of  computation  required  for 
training  can  be  drastically  reduced. 

A  third  significant  result  is  an  explicit  motivation  for  the  use  of  paired  sigmoidal  func¬ 
tions,  in  the  form  of  a  difference,  oix —L)  —  o(z -hL),  rather  than  independent  sigmoidal 
functions,  based  cn  considerations  of  numerical  conditioning  of  mean  square  error  approxima¬ 
tion.  While  the  condition  number  for  approximation  with  sigmoidal  functions  grows  only 
linearly  with  the  number  of  hidden  U'xle.s  .V^,  a  constant  condition  number  can  be  achieved 
using  paired  sigmoids  because  the  difference  of  two  shifted  sigmoids  is  effectively  of  finite 
support.  This  intrinsic  difficulty  \s  nii  ^ruling  behavior  of  training  algorithms  for  SHLFF  has 
•  f  been  previously  made  explic  it  ''m  h  algorithms  should  be  expected  to  require  a  loga- 


rithmicallv  growi;-^  .ve^rd  .mi  euuunii  aA.eurncy 


valid  sotuLc-’ic  It 


to  be  expected  that  the  paired  signuud.il  networks  display  much  improved  scaling  behavior. 


and  the  may  be  useful  for  empirical  studies  of  the  scaling  of  intrinsic  difficulty  of  classes  of 


-  4- 


problems  for  which  backpropagation  training  is  routinely  applied. 

Dynamical  behavior  of  neural  networks  involving  feedback  interconnections  has 
attracted  attention  for  a  variety  of  applications:  (Hopfield)  associative  memory,  optimization, 
and  efficient  realization  of  competitive  selection  processes  such  as  winner-take-all  networks. 
Studying  analog  neural  networks  amounts  to  applying  such  well-known  ideas  as  Lyapunov 
functions,  gradient  flows,  etc,,  from  differential  equations  and  dynamical  systems.  Of  some 
interest  is  the  limiting  behavior  of  a  parametrized  family  of  analog  neural  networks  which 
are  intended  to  approach  a  suitable  discrete  model,  e.g  one  employing  binary  threshold  ele¬ 
ments  rather  than  sigmoidal  nonlinearities.  Arguments  about  how  analog  behavior 
approaches  some  kind  of  discrete  behavior  in  the  limit  of  infinite  gain  sigmoidal  nonlinearities 
have  been  advanced  and  experimentally  validated,  typically  using  a  sequence  of  “frozen”  gain 
(piecewise  constant  gain)  networks. 

Our  work  focused  mainly  on  an  interesting  stability  problem  for  trellis  networks  involv¬ 
ing  a  sequence  of  identical  layers  with  a  fixed  pattern  of  feedforward  connections  between 
layers  and  with  “on-center,  off-surround”  competitive  feedback  interconnections  within  each 
layer.  Such  networks  were  developed  some  time  ago  using  maximum  likelihood  sequence 
estimation  for  convolutional  decoding  as  a  motivation,  arid  they  have  been  shown  to  provide 
error-correcting  capabilities  while  allowing  for  an  efficient  means  of  incorporating  fault  toler¬ 
ance  [3].  Besides  an  ongoing  theoretical  investigation  of  the  stability  issues  involving  trellis 
networks,  we  have  worked  on  exploring  how  this  architecture  may  be  extended  to  obtain  the 
capacity  for  “self-repair”  in  networks;  our  approach  is  to  introduce  one  or  more  redundant 
spare  neurons  at  each  stage  of  the  trellis.  It  turns  out  that  the  self-repairing  process  can 
proceed  at  the  same  time  gradient-type  training  is  being  used.  The  automatic  replacement 
of  faulty  neurons  by  spares  is  accomplished  at  some  loss  in  the  speed  of  convergence  of  learn¬ 
ing  as  would  be  expected.  A  paper  discussing  this  work  has  appeared  in  th**  fEEE  Trans,  on 
Neural  Networks  [3],  and  another  survey  of  this  general  set  of  problems  was  presented  at  the 
1990  Conference  on  Decision  and  Control  li. 

As  an  att'*mnt  to  exploit  extra  hardware  for  increasing  the  learning  rate  of  feedforward 
networks,  we  explored  analogies  with  ((x^perating  parallel  systems  occurring  in  nature.  In 
particular,  a  model  inspired  by  the  flocking  behavior  of  birds  and  the  schooling  behavior  of 
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fish  has  been  formulated  to  investigate  loosely-coupled  parallel  learning  prcx-esses  for  SHLFF 
networks  according  to  a  back- propagation  paradigm.  A  paper  on  this  subject  was  presented 
at  the  1989  Conference  on  Information  Sciences  and  Systems  at  Johns  Hopkins  [5],  and 
another  paper  was  presented  as  a  contribution  to  an  invited  session  on  Neural  Networks  for 
the  1989  IEEE  Conference  on  Decision  and  Control  in  Tampa  [61.  A  fully  distributed  ver¬ 
sion  of  the  parallel  learning  algoiilhia,  together  with  constraints  that  insure  (asymptotically) 
that  learning  is  not  accomplished  at  the  expense  of  good  generalization  capabilities,  remain 
for  further  investigation. 

We  also  pursued  our  idea  of  connecting  some  notions  from  computational  complexity 
theory  with  limitations  on  performance  of  analog  neural  networks  (as  the  particular  case  of 
analog  computing  system  of  primary  interest).  Here  our  work  was  very  much  exploratory, 
and  we  looked  into  neural  network  models  where  chaotic  dynamics  might  play  a  necessary 
role  in  allowing  the  computation  of  solutions  corresponding  to  intractable,  i.e.  .NP-complete, 
combinatorial  optimization  problems.  We  also  explored  how  even  simpler  combinatorial 
problems  can  lead  to  problematical  analog  neural  network  “solution”  methods.  For  linearly- 
separable  binary  classification  problems,  using  work  by  Sontag,  we  showed  that  the  discreti¬ 
zation  of  continuous-time  gradient  learning  algorithms  with  finite  “margins,  ”  which  con¬ 
verge  in  finite  time,  produces  a  polynomial-time  algorithm.  Some  very  simple  polynomial- 
time  algorithms  for  trivial  combinatorial  problems  such  as  matrix  transposition  using 
Hopfield-type  networks  were  obtained.  A  talk  on  our  work  relating  computational  complex¬ 
ity  and  neural  network  behavior  was  presented  as  part  of  an  invited  session  at  the  1988  IEEE 
Conference  on  Decision  and  Control  in  Austin  [7].  Overall,  we  are  not  able  to  prove  any 
general  results  that  would  bear  on  the  use  of  continuous-time  neural  networks  for  solving 
combinatorial  optimization  problems.  Perhaps  it  is  the  case  that  only  combinatorial  prob¬ 
lems  solvable  in  linear-time  can  be  sr>lved  with  such  a  form  of  analog  computation  (i.e.  with 
differential  equations  such  as  those  describing  a  continuous-time  neural  network).  Brockett’s 
analog  sorter,  described  by  gradient  dynamics  on  a  manifold  of  orthogonal  matrices,  is  an 
example  showing  that  nlogn  combinatorial  problems  need  not  have  fast  (i.e.  polynomial  time) 
analog  solutions. 

To  tackle  learning  problems  within  a  .satisfying  mathematical  framework,  we  studied 
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some  important  work  on  learning  theory  by  Valiant  (the  PAC,  or  probably  almost  correct, 
lea'-ning  paradigm),  and  on  performance  bounds  for  multilayer  perceptrons  by  Baum  and 
others  where  the  useful  notion  of  VC-dimension  provides  a  means  of  characterizing  perfor¬ 
mance  limitations.  This  work  h;\s  suggested  to  us  the  need  to  investigate  models  where 
examples  are  not  drawn  independently,  since  this  is  the  case  of  practical  interest  in  most  sig¬ 
nal  processing  applications.  Baum  himself  has  looked  into  learning  from  example  and 
queries,  and  the  queries  may  be  thought  of  as  an  extreme  (highly  beneficial)  form  of  depen¬ 
dence  of  examples.  Our  work  dealt  with  a  more  “classical”  statistical  setting,  motivated  by 
the  process  of  sampling  from  a  population  without  replacement. 

Our  method  was  to  recast  the  known  results  on  sample  complexity  for  independent 
sampling  to  explicitly  account  for  dependence.  A  new  version  of  the  proof  showing  how  sam¬ 
ple  size  grows  with  the  VC-dimension  of  the  class  of  candidate  hypotheses  in  Valiant’s  PAC 
learning  model  was  developed.  The  proof  displays  the  properties  of  certain  conditional  pro¬ 
babilities  that  assure  that  learning  is  as  least  as  fast  as  for  independent  sampling.  Basically, 
the  conditional  probabilities  must  promote  the  sampling  of  a  sufficiently  rich  set  of  examples, 
and  do  so  at  least  as  well  as  examples  generated  by  independent  sampling.  For  the  case  of 
examples  generated  as  samples  of  low-dimensional  Markov  chains,  an  augmented  state  model 
related  to  first  return  times  may  be  used  to  compute  the  conditional  probabilities  that  must 
be  checked  to  verify  improved  learning  rates.  Considerably  more  work  is  necessary  to  gain  a 
full  understanding  of  many  issues.  It  appears  that  a  learning  theory  based  on  robust  statisti¬ 
cal  inference  rather  than  nonpararnetric  inference  may  be  better  able  to  provide  answers  to 
practical  questions  about  learning  from  dependent  examples,  and  this  remains  as  a  topic  for 
future  research. 
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A  list  of  publications  prepared  with  the  support  of  the  grant  follows,  numbered  accord¬ 
ing  to  the  citations  in  the  preceding  report  of  accomplishments.  In  addition,  much  of  the 
work  described  in  the  Ph.D.  thesis  of  Dr.  Sean  M.  Carroll,  now  an  .Assistant  Professor  in  the 
Department  of  Electrical  Engineering  at  Tri-State  University,  .Angola,  Indiana,  was  sup¬ 
ported  by  the  grant.  The  thesis,  entitled  Intelligent  Least  Squares  Methods,  was  completed 
in  August,  1990.  A  copy  of  the  .Abstract  of  Dr.  Carroll’s  dissertation  is  attached  to  this 
report. 
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Intelligeat  Nonlinear  Least-Squares  Methods 


Sean  Michael  Carroll 

Abstract 

This  thesis  investigates  some  mathematical  problems  arising  in  the  modeling  o: 
input-output  mappings.  The  investigation  is  motivated  by  the  need  tc  ccnfron:  ui- 
rectly  the  nonlinearities  that  arise  in  the  construction  of  approximate  m.odels.  After  a 
brief  introduction  in  Chapter  1,  Chapter  2  presents  a  model-simplification  algorithm 
for  discrete-time  linear  systems  by  assuming  that  certain  operators  commute,  then 
invoking  linear  least-squares  methods  to  produce  near-optimal  solutions.  Chapters  3 
and  4  investigate  function  approximation  by  means  of  neural  networks,  interconnected 
systemis  of  simple  elements  which  are  nonlinear  and  adaptive.  Chapter  3  develops  an 
analytical,  constructive  approximation  procedure  applicable  to  a  large  class  of  func 
tions.  This  is  accomplished  by  exploiting  an  analogy  between  an  operator  involved  ;n 
the  inverse  Radon  Transform  and  the. operation  performed  by  a  feedforward  neura. 
network  with  a  single  hidden  layer.  The  high-dimensional  approximation  problem, 
is  reduced  to  a  series  of  scalar  approximation  problems  involving  projections  ot  the 
original  function  onto  lines.  Chapter  4  describes  the  efficient  solution  of  such  scalar 
approximation  problems  by  noting  the  generally  good  performance  of  B-splines  anu 
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derr.or.straiLig  that  r.eurai  r.et-.vorks  can  be  structured  to  behave  in  very  sirruiar  -.vavs 
Chapter  5  studies  the  problem  of  learning  concepts  from  positive  and  negative  exam¬ 
ples.  When  the  examiples  are  chosen  randomly  and  independently,  it  is  known  how 
m.any  exam.ples  are  needed;  our  work  extends  this  result  to  the  m.ore  general  case  of 
correlated  examples.  The  results  are  interpreted  to  show  chat  the  statistical  behav¬ 
ior  agrees  with  intuition.  We  also  formulate  an  alternative  model  of  learning  based 
on  statistical  robustness  rather  than  on  the  nonparametric  theory  already  in  use. 
concluding  chapter  reviews  the  techniques  used  and  open  problems  uncovered  in  the 
previous  chapters. 
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